The search functionality is under construction.

Author Search Result

[Author] Kiyohiro SHIKANO(45hit)

21-40hit(45hit)

  • FOREWORD

    Kiyohiro SHIKANO  

     
    FOREWORD

      Vol:
    E88-D No:3
      Page(s):
    365-365
  • Interface for Barge-in Free Spoken Dialogue System Using Nullspace Based Sound Field Control and Beamforming

    Shigeki MIYABE  Hiroshi SARUWATARI  Kiyohiro SHIKANO  Yosuke TATEKURA  

     
    PAPER-Speech/Audio Processing

      Vol:
    E89-A No:3
      Page(s):
    716-726

    In this paper, we describe a new interface for a barge-in free spoken dialogue system combining multichannel sound field control and beamforming, in which the response sound from the system can be canceled out at the microphone points. The conventional method inhibits a user from moving because the system forces the user to stay at a fixed position where the response sound is reproduced. However, since the proposed method does not set control points for the reproduction of the response sound to the user, the user is allowed to move. Furthermore, the relaxation of strict reproduction for the response sound enables us to design a stable system with fewer loudspeakers than those used in the conventional method. The proposed method shows a higher performance in speech recognition experiments.

  • Fast-Convergence Algorithm for Blind Source Separation Based on Array Signal Processing

    Hiroshi SARUWATARI  Toshiya KAWAMURA  Tsuyoki NISHIKAWA  Kiyohiro SHIKANO  

     
    LETTER-Convolutive Systems

      Vol:
    E86-A No:3
      Page(s):
    634-639

    We propose a new algorithm for blind source separation (BSS), in which independent component analysis (ICA) and beamforming are combined to resolve the low-convergence problem through optimization in ICA. The proposed method consists of the following two parts: frequency-domain ICA with direction-of-arrival (DOA) estimation, and null beamforming based on the estimated DOA. The alternation of learning between ICA and beamforming can realize fast- and high-convergence optimization. The results of the signal separation experiments reveal that the signal separation performance of the proposed algorithm is superior to that of the conventional ICA-based BSS method.

  • A Speech Dialogue System with Multimodal Interface for Telephone Directory Assistance

    Osamu YOSHIOKA  Yasuhiro MINAMI  Kiyohiro SHIKANO  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    616-621

    This paper describes a multimodal dialogue system employing speech input. This system uses three input methods (through a speech recognizer, a mouse, and a keyboard) and two output methods (through a display and using sound). For the speech recognizer, an algorithm is employed for large-vocabulary speaker-independent continuous speech recognition based on the HMM-LR technique. This system is implemented for telephone directory assistance to evaluate the speech recognition algorithm and to investigate the variations in speech structure that users utter to computers. Speech input is used in a multimodal environment. The collecting of dialogue data between computers and users is also carried out. Twenty telephone-number retrieval tasks are used to evaluate this system. In the experiments, all the users are equally trained in using the dialogue system with an interactive guidance system implemented on a workstation. Simplified city maps that indicate subscriber names and addresses are used to reduce the implicit restrictions imposed by written sentences, thus allowing each user to develop his own forms of expression. The task completion rate is 99.0% and approximately 75% of the users say that they prefer this system to using a telephone book. Moreover, there is a significant decrease in nonkeyword usage, i.e., the usage of words other than names and addresses, for users who receive more utterance practice.

  • Stable Learning Algorithm for Blind Separation of Temporally Correlated Acoustic Signals Combining Multistage ICA and Linear Prediction

    Tsuyoki NISHIKAWA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER

      Vol:
    E86-A No:8
      Page(s):
    2028-2036

    We newly propose a stable algorithm for blind source separation (BSS) combining multistage ICA (MSICA) and linear prediction. The MSICA is the method previously proposed by the authors, in which frequency-domain ICA (FDICA) for a rough separation is followed by time-domain ICA (TDICA) to remove residual crosstalk. For temporally correlated signals, we must use TDICA with a nonholonomic constraint to avoid the decorrelation effect from the holonomic constraint. However, the stability cannot be guaranteed in the nonholonomic case. To solve the problem, the linear predictors estimated from the roughly separated signals by FDICA are inserted before the holonomic TDICA as a prewhitening processing, and the dewhitening is performed after TDICA. The stability of the proposed algorithm can be guaranteed by the holonomic constraint, and the pre/dewhitening processing prevents the decorrelation. The experiments in a reverberant room reveal that the algorithm results in higher stability and separation performance.

  • Blind Separation and Deconvolution for Convolutive Mixture of Speech Combining SIMO-Model-Based ICA and Multichannel Inverse Filtering

    Hiroshi SARUWATARI  Hiroaki YAMAJO  Tomoya TAKATANI  Tsuyoki NISHIKAWA  Kiyohiro SHIKANO  

     
    PAPER-Engineering Acoustics

      Vol:
    E88-A No:9
      Page(s):
    2387-2400

    We propose a new two-stage blind separation and deconvolution strategy for multiple-input multiple-output (MIMO)-FIR systems driven by colored sound sources, in which single-input multiple-output (SIMO)-model-based ICA (SIMO-ICA) and blind multichannel inverse filtering are combined. SIMO-ICA can separate the mixed signals, not into monaural source signals but into SIMO-model-based signals from independent sources as they are at the microphones. After the separation by the SIMO-ICA, a blind deconvolution technique for the SIMO model can be applied even when each source signal is temporally correlated and the mixing system has a nonminimum phase property. The simulation results reveal that the proposed algorithm can successfully achieve separation and deconvolution of a convolutive mixture of speech, and outperforms a number of conventional ICA-based BSD methods.

  • Sound Field Reproduction by Wavefront Synthesis Using Directly Aligned Multi Point Control

    Noriyoshi KAMADO  Haruhide HOKARI  Shoji SHIMADA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Engineering Acoustics

      Vol:
    E94-A No:3
      Page(s):
    907-920

    In this paper, we present a comparative study on directly aligned multi point controlled wavefront synthesis (DMCWS) and wave field synthesis (WFS) for the realization of a high-accuracy sound reproduction system, and the amplitude, phase and attenuation characteristics of the wavefronts generated by DMCWS and WFS are assessed. First, in the case of DMCWS, we derived an optimal control-line coordinate based on a numerical analysis. Next, the results of computer simulations revealed that the wavefront in DMCWS has wide applicability in both the spatial and frequency domains with small amplitude and phase errors, particularly above the spatial aliasing frequency in WFS, and we clarified that the amplitude error in DMCWS has similar behavior to the well-known approximate expression for spatial decay in WFS; this implies the ease of taking into account estimating the amplitude error in DMCWS. Finally, we developed wavefront measurement system and measured a DMCWS wavefront using our wavefront measurement system and algorithm. The results of measurements clarified the frequency characteristics of a loudspeaker. Also, DMCWS has wide applicability in frequency domains in actual environments. From these findings, we concluded the advantageousness of DMCWS compared with WFS.

  • Evaluation of Extremely Small Sound Source Signals Used in Speaking-Aid System with Statistical Voice Conversion

    Keigo NAKAMURA  Tomoki TODA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Rehabilitation Engineering and Assistive Technology

      Vol:
    E93-D No:7
      Page(s):
    1909-1917

    We have so far proposed a speaking-aid system for laryngectomees using a statistical voice conversion technique. In the proposed system, artificial speech articulated with extremely small sound source signals is detected with a Non-Audible Murmur (NAM) microphone, and then, the detected artificial speech is converted into more natural voice in a probabilistic manner. Although this system basically allows laryngectomees to speak while keeping the external source signals silent, it is still questionable how much these new sound source signals affect the converted speech quality. In this paper, we investigate the impact of various sound source signals on voice conversion accuracy. Various small sound source signals are designed by changing the spectral envelope and the waveform power independently. We conduct objective and subjective evaluations. The results of these experimental evaluations demonstrate that voice conversion accepts 1) various sound source signals with different spectral envelopes and 2) large degree of power of the sound source signals unless the power of speaking parts is almost equal to that of silence parts. Moreover, we also investigate the effectiveness of enhancing auditory feedback during speaking with the extremely small sound source signals.

  • A Microphone Array-Based 3-D N-Best Search Method for Recognizing Multiple Sound Sources

    Panikos HERACLEOUS  Satoshi NAKAMURA  Takeshi YAMADA  Kiyohiro SHIKANO  

     
    PAPER-Speech and Hearing

      Vol:
    E85-D No:6
      Page(s):
    994-1002

    This paper describes a method for hands-free speech recognition, and particularly for the simultaneous recognition of multiple sound sources. The method is based on the 3-D Viterbi search, i.e., extended to the 3-D N-best search method enabling the recognition of multiple sound sources. The baseline system integrates two existing technologies--3-D Viterbi search and conventional N-best search--into a complete system. Previously, the first evaluation of the 3-D N-best search-based system showed that new ideas are necessary to develop a system for the simultaneous recognition of multiple sound sources. It found two factors that play important roles in the performance of the system, namely the different likelihood ranges of the sound sources and the direction-based separation of the hypotheses. In order to solve these problems, we implemented a likelihood normalization and a path distance-based clustering technique into the baseline 3-D N-best search-based system. The performance of our system was evaluated through experiments on simulated data for the case of two talkers. The experiments showed significant improvements by implementing the above two techniques. The best results were obtained by implementing the two techniques and using a microphone array composed of 32 channels. More specifically, the Word Accuracy for the two talkers was higher than 80% and the Simultaneous Word Accuracy (where both sources are correctly recognized simultaneously) was higher than 70%, which are very promising results.

  • Interface for Barge-in Free Spoken Dialogue System Combining Adaptive Sound Field Control and Microphone Array

    Tatsunori ASAI  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    LETTER-Speech and Hearing

      Vol:
    E88-A No:6
      Page(s):
    1613-1618

    This paper describes a new interface for a barge-in free spoken dialogue system combining an adaptive sound field control and a microphone array. In order to actualize robustness against the change of transfer functions due to the various interferences, the barge-in free spoken dialogue system which uses sound field control and a microphone array has been proposed by one of the authors. However, this method cannot follow the change of transfer functions because the method consists of fixed filters. To solve the problem, we introduce a new adaptive sound field control that follows the change of transfer functions.

  • A Self-Generator Method for Initial Filters of SIMO-ICA Applied to Blind Separation of Binaural Sound Mixtures

    Tomoya TAKATANI  Satoshi UKAI  Tsuyoki NISHIKAWA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Blind Source Separation

      Vol:
    E88-A No:7
      Page(s):
    1673-1682

    In this paper, we address the blind separation problem of binaural mixed signals, and we propose a novel blind separation method, in which a self-generator for initial filters of Single-Input-Multiple-Output-model-based independent component analysis (SIMO-ICA) is implemented. The original SIMO-ICA which has been proposed by the authors can separate mixed signals, not into monaural source signals but into SIMO-model-based signals from independent sources as they are at the microphones. Although this attractive feature of SIMO-ICA is beneficial to the binaural sound separation, the current SIMO-ICA has a serious drawback in its high sensitivity to the initial settings of the separation filter. In the proposed method, the self-generator for the initial filter functions as the preprocessor of SIMO-ICA, and thus it can provide a valid initial filter for SIMO-ICA. The self-generator is still a blind process because it mainly consists of a frequency-domain ICA (FDICA) part and the direction of arrival estimation part which is driven by the separated outputs of the FDICA. To evaluate its effectiveness, binaural sound separation experiments are carried out under a reverberant condition. The experimental results reveal that the separation performance of the proposed method is superior to those of conventional methods.

  • Music Signal Separation Based on Supervised Nonnegative Matrix Factorization with Orthogonality and Maximum-Divergence Penalties

    Daichi KITAMURA  Hiroshi SARUWATARI  Kosuke YAGI  Kiyohiro SHIKANO  Yu TAKAHASHI  Kazunobu KONDO  

     
    LETTER-Engineering Acoustics

      Vol:
    E97-A No:5
      Page(s):
    1113-1118

    In this letter, we address monaural source separation based on supervised nonnegative matrix factorization (SNMF) and propose a new penalized SNMF. Conventional SNMF often degrades the separation performance owing to the basis-sharing problem. Our penalized SNMF forces nontarget bases to become different from the target bases, which increases the separated sound quality.

  • Blind Separation of Speech by Fixed-Point ICA with Source Adaptive Negentropy Approximation

    Rajkishore PRASAD  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Blind Source Separation

      Vol:
    E88-A No:7
      Page(s):
    1683-1692

    This paper presents a study on the blind separation of a convoluted mixture of speech signals using Frequency Domain Independent Component Analysis (FDICA) algorithm based on the negentropy maximization of Time Frequency Series of Speech (TFSS). The comparative studies on the negentropy approximation of TFSS using generalized Higher Order Statistics (HOS) of different nonquadratic, nonlinear functions are presented. A new nonlinear function based on the statistical modeling of TFSS by exponential power functions has also been proposed. The estimation of standard error and bias, obtained using the sequential delete-one jackknifing method, in the approximation of negentropy of TFSS by different nonlinear functions along with their signal separation performance indicate the superlative power of the exponential-power-based nonlinear function. The proposed nonlinear function has been found to speed-up convergence with slight improvement in the separation quality under reverberant conditions.

  • Cost Reduction of Acoustic Modeling for Real-Environment Applications Using Unsupervised and Selective Training

    Tobias CINCAREK  Tomoki TODA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Acoustic Modeling

      Vol:
    E91-D No:3
      Page(s):
    499-507

    Development of an ASR application such as a speech-oriented guidance system for a real environment is expensive. Most of the costs are due to human labeling of newly collected speech data to construct the acoustic model for speech recognition. Employment of existing models or sharing models across multiple applications is often difficult, because the characteristics of speech depend on various factors such as possible users, their speaking style and the acoustic environment. Therefore, this paper proposes a combination of unsupervised learning and selective training to reduce the development costs. The employment of unsupervised learning alone is problematic due to the task-dependency of speech recognition and because automatic transcription of speech is error-prone. A theoretically well-defined approach to automatic selection of high quality and task-specific speech data from an unlabeled data pool is presented. Only those unlabeled data which increase the model likelihood given the labeled data are employed for unsupervised training. The effectivity of the proposed method is investigated with a simulation experiment to construct adult and child acoustic models for a speech-oriented guidance system. A completely human-labeled database which contains real-environment data collected over two years is available for the development simulation. It is shown experimentally that the employment of selective training alleviates the problems of unsupervised learning, i.e. it is possible to select speech utterances of a certain speaker group but discard noise inputs and utterances with lower recognition accuracy. The simulation experiment is carried out for several selected combinations of data collection and human transcription period. It is found empirically that the proposed method is especially effective if only relatively few of the collected data can be labeled and transcribed by humans.

  • Robustness of Phoneme-Based HMMs against Speaking-Style Variations

    Tatsuo MATSUOKA  Kiyohiro SHIKANO  

     
    PAPER-Phoneme Recognition and Word Spotting

      Vol:
    E74-A No:7
      Page(s):
    1761-1767

    In a practical continuous speech recognition system, the target speech is often spoken in a different speaking-style (e.g., speed or loudness) from the training speech. It is difficult to cope with such speaking style variations because the amount of training speech is limited. Therefore, acoustic modeling should be robust against different styles of speech in order to obtain high recognition performance from the limited training speech. This paper describes robustness of six of phoneme-based HMMs against speaking-style variations. The six types of model were VQ-and FVQ-based discrete HMMs, and single-Gaussian and mixture-Gaussian HMMs with either diagonal or full covariance matrices. They were investigated using isolated word utterance, phrase-by-phrase utterance and fluently spoken utterance, with different utterance types for training and testing. The experimental results show that the mixture-Gaussian HMM with diagonal covariance matrices is the most promising choice. The FVQ-based HMM and the single-Gaussian HMM with full covariance matrices also achieved good results. The mixture-Gaussian HMM with full covariance matrices sometime achieved very high accuracies, but often suffered from "overtuning" or a lack of training data. Finally this paper proposes a new model-adaptation technique that combines multiple models with appropriate weighting factors. Each model has different characteristics (e.g., coverage of speaking styles and sensitivety to data), and weighting factors can be estimated using "deletedinterpolation". When the mixture-Gaussian diagonal covariance models were used as baseline models, this technique achieved better recognition accuracy than a model trained using all three utterance types at a time. The advantage of this technique is that estimating the weighting factors is stable even from a limited amount of training speech because there are few free parameters to be estimated.

  • Theoretical Analysis of Amounts of Musical Noise and Speech Distortion in Structure-Generalized Parametric Blind Spatial Subtraction Array

    Ryoichi MIYAZAKI  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    LETTER-Engineering Acoustics

      Vol:
    E95-A No:2
      Page(s):
    586-590

    We propose a structure-generalized blind spatial subtraction array (BSSA), and the theoretical analysis of the amounts of musical noise and speech distortion. The structure of BSSA should be selected according to the application, i.e., a channelwise BSSA is recommended for listening but a conventional BSSA is suitable for speech recognition.

  • Integration of Speech Recognition and Language Processing in a Japanese to English Spoken Language Translation System

    Tsuyoshi MORIMOTO  Kiyohiro SHIKANO  Kiyoshi KOGURE  Hitoshi IIDA  Akira KUREMATSU  

     
    PAPER-Speech Understanding

      Vol:
    E74-A No:7
      Page(s):
    1889-1896

    The experimental spoken language translation system (SL-TRANS) has been implemented. It can recognize Japanese speech, translate it to English, and output a synthesized English speech. One of the most important problems in realizing such a system is how to integrate, or connect, speech recognition and language processing. In this paper, a new method realized in the system is described. The method is composed of three processes: grammar-driven predictive speech recognition, Kakariuke-dependency-based candidate filtering, and HPSG-based lattice parsing which is supplemented with a sentence preference mechanism. Input speech is uttered phrase by phrase. The speech recognizer takes an input phrase utterance and outputs several candidates with recognition scores for each phrase. Japanese phrasal grammar is used in recognition. It contributes to the output of grammatically well-formed phrase candidates, as well as to the reduction of phone perplexity. The candidate filter takes a phrase lattice, which is a sequence of multiple candidates for a phrase, and outputs a reduced phrase lattice. It removes semantically inappropriate phrase candidates by applying the Kakariuke dependency relationship between phrases. Finally, the HPSG-based lattice parser takes a phrase lattice and chooses the most plausible sentence by checking syntactic and semantic legitimacy or evaluating sentential preference. Experiment results for the system are also reported and the usefulness of the method is confirmed.

  • Probability Distribution of Time-Series of Speech Spectral Components

    Rajkishore PRASAD  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Audio/Speech Coding

      Vol:
    E87-A No:3
      Page(s):
    584-597

    This paper deals with the statistical modeling of a Time-Frequency Series of Speech (TFSS), obtained by Short-Time Fourier Transform (STFT) analysis of the speech signal picked up by a linear microphone array with two elements. We have attempted to find closer match between the distribution of the TFSS and theoretical distributions like Laplacian Distribution (LD), Gaussian Distribution (GD) and Generalized Gaussian Distribution (GGD) with parameters estimated from the TFSS data. It has been found that GGD provides the best models for real part, imaginary part and polar magnitudes of the time-series of the spectral components. The distribution of the polar magnitude is closer to LD than that of the real and imaginary parts. The distributions of the real and imaginary parts of TFSS correspond to strongly LD. The phase of the TFSS has been found uniformly distributed. The use of GGD based model as PDF in the fixed-point Frequency Domain Independent Component Analysis (FDICA) provides better separation performance and improves convergence speed significantly.

  • An Iterative Inverse Filter Design Method for the Multichannel Sound Field Reproduction System

    Yosuke TATEKURA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER

      Vol:
    E84-A No:4
      Page(s):
    991-998

    To achieve a sound field reproduction system, it is important to design multichannel inverse filters which cancel the effects of room transfer functions. The design method in the frequency domain based on the least-norm solution (LNS) requires less memory and less calculation than the design method in the time domain. However, the LNS method cannot guarantee the causality or stability of the filters. In this paper, a design method of a time-domain inverse filter using iterative processing in the frequency domain for multichannel sound field reproduction is proposed, and the result of numerical analysis is described. The proposed method can decrease the squared error of every control point by 3-12 dB. Furthermore, the sound reproduced by this method attains over 13 dB improvement in the segmental signal-noise ratio (SNR) compared with that designed by the LNS method for real environment impulse responses.

  • Designing Target Cost Function Based on Prosody of Speech Database

    Kazuki ADACHI  Tomoki TODA  Hiromichi KAWANAMI  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Speech Synthesis and Prosody

      Vol:
    E88-D No:3
      Page(s):
    519-524

    This research aims to construct a high-quality Japanese TTS (Text-to-Speech) system that has high flexibility in treating prosody. Many TTS systems have implemented a prosody control system but such systems have been fundamentally designed to output speech with a standard pitch and speech rate. In this study, we employ a unit selection-concatenation method and also introduce an analysis-synthesis process to provide precisely controlled prosody in output speech. Speech quality degrades in proportion to the amount of prosody modification, therefore a target cost for prosody is set to evaluate prosodic difference between target prosody and speech candidates in such a unit selection system. However, the conventional cost ignores the original prosody of speech segments, although it is assumed that the quality deterioration tendency varies in relation to the pitch or speech rate of original speech. In this paper, we propose a novel cost function design based on the prosody of speech segments. First, we recorded nine databases of Japanese speech with different prosodic characteristics. Then with respect to the speech databases, we investigated the relationships between the amount of prosody modification and the perceptual degradation. The results indicate that the tendency of perceptual degradation differs according to the prosodic features of the original speech. On the basis of these results, we propose a new cost function design, which changes a cost function according to the prosody of a speech database. Results of preference testing of synthetic speech show that the proposed cost functions generate speech of higher quality than the conventional method.

21-40hit(45hit)